A molecular barcode and web-based data analysis tool to identify imported Plasmodium vivax malaria

Traditionally, patient travel history has been used to distinguish imported from autochthonous malaria cases, but the dormant liver stages of Plasmodium vivax confound this approach. Molecular tools offer an alternative method to identify, and map imported cases. Using machine learning approaches incorporating hierarchical fixation index and decision tree analyses applied to 799 P. vivax genomes from 21 countries, we identified 33-SNP, 50-SNP and 55-SNP barcodes (GEO33, GEO50 and GEO55), with high capacity to predict the infection’s country of origin. The Matthews correlation coefficient (MCC) for an existing, commonly applied 38-SNP barcode (BR38) exceeded 0.80 in 62% countries. The GEO panels outperformed BR38, with median MCCs > 0.80 in 90% countries at GEO33, and 95% at GEO50 and GEO55. An online, open-access, likelihood-based classifier framework was established to support data analysis (vivaxGEN-geo). The SNP selection and classifier methods can be readily amended for other use cases to support malaria control programs.


Analysis Overview
An overview of the analysis plan for this study is outlined as a flowchart presented in Figure 1 of the main text. In brief, the analytical process involved three key steps: 1. Data preparation This step entailed a series of sample and SNP filtering steps to obtain suitable genomic datasets for downstream analysis. The filtering steps aimed to derive an optimal balance of maximum numbers of samples and SNPs against minimum genotype missingness (i.e., positions with no reads).

SNP selection
This step entailed selection of SNPs using DecisionTree and HFST approaches to obtain panels with the minimal numbers of SNPs that could effectively classify infections by country at the different thresholds applied using the classifier developed in this study.

Evaluation and assessment
This step entailed evaluation of the selected SNP sets and the classifiers with stratified k-fold crossvalidation. In addition to evaluation with no missing data, the performance of the selected SNP sets was further assessed at varying levels of missing data (i.e., genotype failures). Lastly, the performance of the selected SNP sets was assessed using an independent dataset.
An online, open-access platform was then developed for end-users to readily analyze data generated at the selected SNP sets to predict the country of origin.

Dataset
The dataset used in the analysis included genomic data on P. vivax isolates derived from the Malaria Genomic Epidemiology (MalariaGEN) P. vivax Genome Variation Project Pv4 open dataset, which were collected from 26 countries 1 . All samples were processed within the framework of the MalariaGEN P. vivax Genome Variation Project. Briefly, all new isolates were subjected to whole genome sequencing at the Wellcome Sanger Institute using standard library preparation and sequencing on the Illumina HiSeq platform to generate 150bp paired end reads. The new data was combined with publicly available data, aligned against the P. vivax PvP01 reference 2 and subjected to variant calling and annotation according to best practices in the MalariaGEN framework 1 . The resultant VCF files, as well as the accompanying metadata for the Pv4 open dataset are available in the MalariaGEN repository (https://www.malariagen.net/resource/30).
For the analysis in this study, the dataset was divided into two parts, the training set and the validation set. The training set consisted of all Pv4 isolates, excluding a set of samples from the GSK study that were reserved for the validation set. Specifically, the validation set consisted of all isolates from the GSK study except for those from the Philippines. The decision to use the specified GSK study samples for validation was based on convenience owing to the later timing at which sequencing data from the GSK study were made available.

Data Preparation
An overview of the data preparation steps is outlined in section a. Data Preparation of the flowchart presented in Figure 1.
Prior to data filtering, all Pv4 VCF files from the MalariaGEN repository were reprocessed to set a minimum read depth of 1. Instead of using the GT field, individual genotype calls were redefined as heterozygotes based on an arbitrary threshold of a minor allele read ratio > 0.1 and a minimum of 2 reads for each allele; all other genotype calls were defined as homozygous for the major allele. All VCF processes were performed using BCFTools and custom python scripts. All consecutive analysis used the genetic data from this reprocessed VCF file.
The training dataset was filtered initially to exclude recurrent infections in the same individuals based on the Pv4 metadata information and to include only samples to which we had access (data published or provided with permission from the sample contributors as our study was undertaken prior to the open data release). All samples whose countries were represented by less than 4 genomic samples (arbitrary threshold) were also removed. In total, this filtering process resulted in an initial dataset comprising 1,348 samples as presented in Supplementary Table 1, from 21 countries as shown in Supplementary Figure 1.
With this dataset, from the initial 2,671,112 discovered variants provided in Pv4, we derived a set of 662,641 high-quality bi-allelic SNPs with VQSLOD score > 0 and minimum Minor Allele Count (MAC) of 2 by employing BCFTools. This dataset with 1,348 samples and 662,641 was denoted as Dataset 0.
Dataset 0 was then subjected to further filtering to remove non-independent samples, arbitrarily defined as isolate pairs with genetic distance less than 0.001 (0.1%). For this and consecutive analysis, genetic distances were computed as proportion of nucleotide or allele differences, with heterozygous calls considered as identical to homozygote calls and any missing genotype was omitted from the calculation. Amongst non-independent sample pairs, the isolate with the higher percentage of genotype failures was removed from the dataset; this removal process was iterated until all non-independent isolates had been removed from the dataset. This filtering process employed a custom Python script and resulted in 1,227 samples with 662,641 SNPs denoted as Dataset 1. The following pseudocode illustrates this filtering process: Prepare dataset X Calculate pairwise genetic distance based on proportion of allele differences WHILE there are pairs with genetic distance < 0.001 Get the pair with lowest genetic distance Remove the isolate of the pair with higher missingness from dataset X Dataset 1 was then subjected to iterative data quality filtering by removing samples and SNPs to obtain the best representative number of samples and informative SNPs, designated as SNPs with MAC ≥ 2, with no genotype missingness. The following pseudocode illustrates the process of obtaining the optimal number of samples and informative SNPs: Prepare dataset X Calculate sample missingness of dataset X Sort sample set in descending order of missingness REPEAT Calculate the number of informative SNPs without any missingness in dataset X Collect the number of samples and number of informative SNPs above Remove sample with highest missingness from sample set and dataset X UNTIL no more sample in sample set and dataset X Plot the collected number of samples vs the number of informative SNPs Plot the collected number of samples vs the changes of the number of informative SNPs The rationale to perform the above filtering process is that while removing relatively lower quality samples from the dataset will increase the number of SNPs without any missing genotype data, the sample removal process might decrease the number of informative (or non-monomorphic) SNPs. Supplementary Figure 2 provides a plot of the number of filtered samples against the number of filtered SNPs and the differences in the number of SNPs (delta SNPs). Based on this plot, with informative SNPs defined as SNPs with MAC ≥ 2, we identified 826 samples and 229,317 SNPs as the optimum for inclusion in Dataset 2.
The isolates in Dataset 2 were initially assigned the country in which the patient presented at the clinic with the infection based on the Pv4 metadata. This initial country-level assignment was evaluated further using 1) country-level prediction derived from the genome-wide data classification with the BALK classifier (described in the next section) and 2) manual confirmation by constructing a neighbor-joining tree. Briefly, the BALK classifier was trained with the Dataset 1 against all 229,317 SNPs, and then each isolate of the Dataset 2 was classified by the trained BALK. For the manual confirmation using neighborjoining tree, a genetic distance matrix was computed from Dataset 1 and a neighbor-joining tree was constructed using the Ape R library 3 . Isolates whose country assignment differed from the prediction result of the trained BALK classifier and that were not in the same country cluster as observed manually from the neighbor-joining tree were considered suspected imported infections and removed from the dataset to produce Dataset 3, comprising 799 samples and 229,317 SNPs.
For comparative assessment of candidate SNP panels, a new dataset (Dataset 4) was produced which comprised the samples in Dataset 3, but only with the SNPs selected by the consecutive SNP selection process and the Broad barcode panel; a SNP barcode panel which consisted of 42 SNPs that had been developed by the Broad Institute team to fingerprint Plasmodium vivax and which has already been implemented in several studies [4][5][6] . Whilst the original Broad barcode comprises 42 SNPs, we were only able to use 38 of these SNPs, herein referred to as BR38, since 4 SNPs were either non-biallelic within the Pv4 dataset or could not be genotyped using amplicon sequencing (i.e., were non-assayable) 4 .
A similar filtering process was applied to the validation set (comprising all the GSK study samples except those from the Philippines). Recurrent infections, inferred from the Pv4 metadata, were also removed from the validation set. However, instead of obtaining the optimal number of samples and informative SNPs as for the training set, we directly included the same 229,317 SNPs used in the training set. Any nonindependent samples were then removed using the same 0.001 threshold of genetic distance, using a similar procedure to that applied to the training set. Country-level assignment was assessed using the same trained BALK classifier as the training set, and a neighbor-joining tree was constructed by combining with Dataset 3 for manual confirmation, resulting in 142 samples in the validation set. Supplementary Figure 3 presents a neighbor-joining tree generated on Dataset 3 combined with the 142 validation samples using the 229,317 SNPs. Further SNP filtering was then applied to only include the 38-SNP Broad barcode (BR38) and the SNPs selected by the SNP selection process, producing the Independent Validation Dataset.

Overview of required classifier
In the machine learning field, a classifier is an algorithm that can be used to assign a class label to a data input based on a set of specific features. In this study, for the purpose of identifying the country of origin of a sample (essentially a P. vivax genome), the class is the country of origin, the features are a set (or panel) of SNP positions (markers), and the data input is the allele at each SNP position. This section describes the processes underlying the decision to apply the Bi-Allele Likelihood Classifier in our study (i.e., as the vivaxGEN-geo classifier), including comparative evaluation against an alternative classifier, namely Bernoulli Naïve Bayes.
For the purposes of classifying a sample by its country of origin, we defined the following requirements for the vivaxGEN-geo classifier: 1. Capable of evaluating existing SNP panels. The rationale was to enable continued use of SNP panels which have already been established in malaria-endemic settings as establishing SNP panels is not a trivial process.
2. Amenable to new SNP additions (extensions) to existing SNP panels. The rationale was that existing SNP panels can be readily optimized (rather than completely replaced) with the addition of new SNPs as they are identified, for example, as new genomic data reflecting new countries or enhancing existing sample size are identified, or as genetic shifts are observed over time.
3. Able to analyze data inputs containing heterozygous genotype calls or genotype fails .
Polyclonal infections are common in some malaria-endemic settings, and may exhibit heterozygous genotype calls (i.e., two or more alleles at a SNP position). Particularly in infections with low parasite densities, genotype failures (i.e., missing data at a given SNP position) may also be observed.
4. Able to provide confidence values of the prediction from input data. The rationale was that the end user (such as a malaria control program officer) can identify when a prediction has low confidence and is therefore able to exercise caution in interpretation of the output.
We identified the Naive Bayes classifier as having the potential to address our classifier requirements after application of several modifications.

Overview of Naive Bayes classifier
Naive Bayes is a well-described classifier, whose classification rule follows the Bayes Theorem 7 : In Bayesian terminology, Pr(C|X) is the posterior probability of being class C, Pr(C) is the prior probability of being class C and the Pr(X|C) is the likelihood of being class C given a set of data X, where C is a class and X is the set of data {x 1 , x 2 , …, x n } representing each of n feature to be used for classifying the data. With genomic data, a feature will be represented by a SNP marker or position and the data is the allele in that position, so that X is {allele at marker-1, allele at marker-2, …, allele marker-n}. Pr(X) is the probability that the data input X occurs in the set, and since it will be identical regardless of the class, it can be dropped from the classification rule to yield: max Pr (C |X )∼Pr (C )Pr (X |C ) The prior probability Pr(C) is usually calculated from the proportion of samples of class C over total training data, which presumptively represent the proportion of the circulating samples of class C over total population. However, for genomic data from various regions and studies, the proportion of samples in the training dataset does not necessarily represent the proportion of the circulating samples over total population, but reflects on the quality of processed samples, the number of studies performed on a region, and many other factors. Hence, for the purpose of classifying by country of origin, the prior probability will be set to uniform distribution (uniform values for all countries) so it can be dropped from the equation and that the classification rule only depends on the likelihood part of the formula, which effectively makes the classification rule equivalent to maximum likelihood: Naive Bayes can handle data with missingness by omitting the features having the missing data from the corresponding sample. For example, if the 3 rd feature out of 5 features has missingness, then the calculation for Pr(X|C) of this sample would be: Assumption on Naive Bayes is that each feature is independent, hence the naiveness of this algorithm. In reality, each feature is not really independent between each other. In practice, it has been shown that even though this assumption is frequently violated, Naive Bayes still delivers good prediction performance 8 .

Classifying biallelic SNP data with Bernoulli Naive Bayes (BNB)
Binomial-based Naive Bayes is a Naive Bayes whose likelihood is a binomial probability distribution. The binomial distribution with parameters n and p is the discrete probability distribution of the number of getting one of the two possible outcomes (such as the head or tail of a coin) in a sequence of n independent experiments, with p as the probability of getting one of the outcomes. Bernoulli Naive Bayes (BNB) is a well-known Naive Bayes classifier whose likelihood formula is the binomial n=1 distribution known as Bernoulli distribution, and is being used when the input data comprises values of 0 or 1.
Biallelic SNP data can be represented by 2 states, reference allele and alternate allele, so it is suitable to be used with binomial-based Naive Bayes. A haploid sample, which only has a single set of chromosomes, can have either 0 alternate allele (when the allele is reference allele) or 1 alternate allele in a single SNP position. Obtaining the allele of a haploid sample from a SNP position can be thought of as a single independent experiment, which corresponds to a binomial probability distribution with N=1 or Bernoulli distribution. Hence, haploid samples can be classified with BNB.
The classification rule for BNB used for biallelic haploid SNP data after omitting the prior probability is: where p i is the proportion of alternate allele at position i for country C and x i is the number of occurrences of alternate allele at position i in the input data, with maximum x i = 1.

Classifying with Biallele Likelihood (BALK) classifier
One of the requirements for the classifier in this analysis is the ability to analyze heterozygous alleles. Heterozygous alleles can arise in samples having polyclonal infection of 2 or more strains. If each sample can be represented as diploid with 2 sets of chromosomes regardless of its polyclonality, then obtaining the allele from a SNP position out of those 2 sets of chromosomes can be thought as 2 consecutive independent experiments, and hence a binomial distribution with N=2 can be used for the outcome probability of this data. The SNP data then can be stated as the number of occurrences of alternate allele in each respective position with homozygous reference allele stated as 0 (0 alternate allele, 2 reference alleles), heterozygous as 1 (1 alternate allele, 1 reference allele), and homozygous alternate allele as 2 (2 alternate alleles, 0 reference allele).
By using a binomial distribution with N=2, the classification rule for biallelic SNP allowing heterozygous call therefore becomes: Omitting the binomial coefficient, as the number would be identical across all classes, then: where p i is the proportion of alternate allele at position i for country C and x i is the number of occurrences of alternate allele in i position, with max x i = 2. We denote this derived classifier as Bi-Allele Likelihood classifier (BALK) to emphasize that it is suitable for bi-allelic SNP data from diploid samples (or any samples that can be treated as diploid for analytic purposes such as P. vivax clinical samples) and that the classifier only depends on the likelihood part of the Naive Bayes equation. In essence, BALK is a N=2 Binomial Naive Bayes whose prior probability is dropped from the classification rule.

Evaluating the performance of BNB and BALK classifiers
To illustrate the performance of BNB and BALK, a illustrative 4-population dataset consisting of samples from Colombia, Ethiopia, Indonesia and Thailand, each with 30 randomly-selected samples and representing different continents, was constructed from the high quality data training set (Dataset 3), with heterozygous SNPs set to their major alleles for BNB. We used BR38 SNP panel which, although not designed for country classification, was good enough for the purpose of comparing BNB and BALK.
The classification performance was measured using the MCC (Matthew Correlation Coefficient) scores for each country. The MCC provides a measure of the quality of the classifications, ranging from -1 (total disagreement) to 1 (perfect prediction). In addition to country-level MCC scores, pooled (cross-country) median and minimum MCC scores were also determined.
For comparative analysis of classification performance, we employed a stratified k-fold cross-validation method. K-fold cross-validation is a resampling procedure used to evaluate the performance and model over-fitting of an untrained classifier on a limited dataset, where model here refers to the feature set, classifier and any parameter needed for the classifier. The stratified k-fold cross-validation method splits samples to k groups while ensuring that each group contains approximately the same proportion of samples of each target class (hence, stratified) and then use one of the group as testing set and the rest of the groups as training set iteratively and consecutively. When the number of samples for a class are less than k, the samples are replicated as necessary so that the sample number is equal or more than k. This cross-validation process can be repeated to assess the performance with different member of each group.
The following pseudocode illustrate the generic steps of repeated stratified k-fold cross-validation combined with MCC scoring above: Prepare dataset X SET random generator with a seed REPEAT Split samples to k group with random stratification strategy FOR each group SET the group as testing set SET the remaining groups as training set Train classifier with training set Predict testing set using the trained classifier Calculate MCC score for each class from the prediction Collect all calculated MCC scores Calculate median MCC score from the collected MCC scores Calculate minimum MCC score from the collected MCC scores UNTIL r repeats The first step in the pseudocode, which is to prepare random generator with a seed, ensures that the iterative random splitting of the dataset will result in the same splits when this procedure is being used for different classifier or different feature sets and the random seed is identical. By following the above procedure, for each repeat, there will be a median MCC score, a minimum MCC score and k MCC scores for each class. These MCC scores then can be visualized using proper plot types such as boxplots.
To compare the performance of BNB and BALK, a stratified 5-fold cross-validation with the 4-population dataset and BR38 panel was performed for 100 repeats on both BNB and BALK. MCC scores for each population were plotted in the following figure: The model in the figure denoted the corresponding trained classifiers against BR38 using the 4-population dataset. The MCC scores indicated that BALK performed either equivalently or better than BNB for certain populations, such as the Indonesian samples. The Indonesian samples have been previously shown to have high levels of heterozygosity (i.e., polyclonal infections), with the models demonstrating that BALK could effectively capture the complexity in these populations.

SNP Selection
An overview of the SNP selection steps is outlined in section b. SNP Selection of the flowchart presented in Figure 1. The dataset was subjected to the following SNP selection approaches: DecisionTree, and both HFST-0.90 and HFST-0.95 (HFST with Fst threshold of 0.9 and 0.95 respectively), described in detail in the following section. For each approach, we used Dataset 3 for both training and testing set and then collected the top-25 SNP sets, ranked based on the average of their minimum MCC and median MCC scores. The top-25 SNP sets were then subjected to the repeated stratified 10-fold cross-validation, and re-ranked based on their average minimum MCC and median MCC scores to obtain the best-performing SNP set for each approach.

SNP selection using DecisionTree
DecisionTree (DT) works by selecting the best features given the data set and then using the selected features to perform the classification simultaneously. The features selected by DT can also be used by other algorithms, such as Naive Bayes, to be used for training and inferring the class of each sample. It should be noted that selected features which give the best prediction scores (MCC) with DT do not necessarily give the best prediction score with Naive Bayes. Likewise, selected features which give non-optimal prediction scores with DT do not necessarily mean that those features will give mediocre prediction scores with Naive Bayes. Hence, iterative or repetitive DT should be performed to obtain sets of selected features which then can be validated and scored using Naive Bayes classifier, or specifically BALK for this study.
For this study, the DecisionTree (DT) was not chosen as a classifier, only as a SNP selector. The rationale is that DT does not work effectively with missing data without introducing bias and has difficulties with heterozygous SNP data. To use DT, all heterozygous SNPs must be converted to their major alleles, which would reduce the actual diversity of the data and might also be problematic when one cannot confidently call the major allele from the experimental data. Working with missing data with DT means that either (i) the features with missing data or the samples with missing data are removed from the overall analysis, (ii) the missing data is treated as one of the possible outcomes, or (iii) the missing data is imputed or substituted using other methods, most likely based on probabilistic methods. All these methods would incur added biases to the original data. On the other hand, Naive Bayes classifiers such as BALK are already based on probabilistic methods and features with missing data can be omitted at individual sample to not incurring other significance biases.
The following pseudocode illustrates the generic steps in selecting the features (or SNPs) using DecisionTree and scoring the features using BALK: In this study, we subjected Dataset 3 to the above procedure with 500 repeats using a custom Python script which utilized DecisionTree implementation from sklearn library version 0.24.2, employing an optimized version of the CART (Classification And Regression Tree) algorithm and Gini coefficient. To avoid overfitting, a minimum of 3 samples was required for a leaf node. After collecting 500 SNP sets and their corresponding MCC scores, the SNP sets were ranked based on the average of their minimum MCC and median MCC scores. The top-25 SNP sets were then subjected to stratified 10-fold cross-validation steps to select the best SNP set by ensuring that the selected SNP set would not be over-fitting to the full dataset. We denoted the best SNP set from this approach as GEO55 since the set consisted of 55 SNPs.

SNP selection by F ST
Naive Bayes works best with features that can distinguish between classes. Hence, SNPs that have unique distinct alleles which can differentiate a population against the other population are the best candidates to be used with Naive Bayes, regardless of the biological properties of the SNPs. Such SNPs are usually very low diversity or even monomorphic within the population, and hence would not be suitable for fingerprinting purposes. One of the metrics in genomic data that can be used to infer unique alleles for differentiating sub populations is F ST .

Whole F ST approach
A straightforward method to employ F ST is by selecting SNPs with highest F ST for each population within the total dataset, or whole F ST of each population. To illustrate, with the previously described 4-population dataset, the SNP selection would follow the following pseudocode: The above procedure would yield 4 SNPs to be used as the feature set for training and prediction. If desired, 2 SNPs with highest F ST for each step can be selected to ensure the redundancy of the SNP set in case there is any missing data in obtaining the SNPs from the experimental procedure.
The problem with this procedure is that the SNP with highest F ST from a population against the rest of populations might not be able to distinguish that particular population from a closely related population. For example, the chosen SNP for Indonesia and the chosen SNP for Thailand might not be able to distinguish between Indonesia and Thailand population since the chosen SNPs would most likely lean toward differentiating Indonesia (or Thailand) against the combined Colombia and Ethiopia.

Pairwise F ST approach
Another method to select SNPs by F ST is to get the highest F ST per pairwise population. With this method, all possible SNPs that can be used to distinguish each population from every other population can be selected. The following pseudocode illustrates the procedure to select the SNPs: The above procedure would yield 6 SNPs to be used as the feature set.
The pairwise F ST approach ensures that there will be at least a SNP that can differentiate a population with closely related populations. However, as the number of populations (countries) increases, the number of selected SNPs increases. For example, if 1 SNP were selected for each pairwise population comparison, then for 15 populations, there would be 105 pairwise comparisons yielding 105 selected SNPs. This defeats the objective to obtain the most parsimonious number of SNPs as required for this study.

Hierarchical F ST (HFST) approach
A more effective method can be devised by combining the whole F ST and pairwise F ST approaches. The Hierarchical F ST (HFST) works by selecting SNPs using the relatedness between each population, which can be inferred by other methods such as using a neighbor-joining population tree generated by Nei's population distance. As an example, assuming that isolates from Indonesia and Thailand were closely related, and that Colombia and Ethiopia were relatively close to each other compared to Indonesia/Thailand isolates, a rooted neighbor-joining population tree constructed from the 4-populations dataset would be: Based on the guide tree above, the following pseudocode illustrates the procedure of Hierarchical F ST : The above procedure would select 3 SNPs to be used as the feature set.
Selecting SNPs with highest F ST might not be the best option as feature sets when combined with the rest of selected SNPs because SNP with the highest F ST for a particular group differentiation, such as population A against B, when combined as a panel, might give detrimental effect to other group differentiation, such as population C against D. To minimize this problem, random selection and iteration is employed to find the best combination. A threshold can be used in HFST and the featured SNPs can be selected randomly based on this threshold value. For example, with a threshold of F ST > 0.95, all SNPs within this threshold can then be chosen randomly. When none of SNPs are within this threshold value in any of the F ST selection steps, a DecisionTree (DT) step can be introduced for this particular step to differentiate between the 2 nodes of the respective branch and the featured SNPs are obtained from the DT feature set. F ST threshold setting can be used to influence the number of selected SNPs since increasing the F ST threshold will increase the likelihood of using DT, and as DT might emit more than 1 SNP for its feature set, the number of selected SNPs will increase as well. In addition, by setting the F ST threshold > 1.0, the HFST method is essentially became a guided DecisionTree.
To illustrate the above concept with the 4-population set, assuming that that there were no SNPs with F ST > certain threshold both for comparison of group (Indonesia vs Thailand) against (Colombia vs Ethiopia) and for Indonesia against Thailand, the procedure would be: By combining random selection and DT method, a stochastic process is introduced to HFST and hence by repeating the HFST process, SNP sets (or feature sets) can be collected and ranked based on their MCC scores.
To create a guide tree in this study, we calculated the genetic distance matrix of all samples in Dataset 3 using all SNPs. Based on the genetic distance matrix, we computed the population genetic distance matrix using Nei's net average number of differences between populations (D A ) 9 . D A between population 1 and 2 is defined as: with D as the raw number of nucleotide differences between populations, defined as: where k and l are the number of distinct haplotypes in population 1 and 2 respectively, x 1i is the frequency of the i-th haplotype in population 1, x 2j is the frequency of the j-th haplotype in population 2, and δ ij is the genetic distance (or proportion of number of allele differences) between haplotype i-th and haplotype j-th.
Alternatively, D can also be computed as: where m and n are the number of samples in population 1 and 2 respectively, and d ij is the genetic distance between sample i in population 1 and sample j in population 2.
Based on the Nei's net average population genetic distance matrix, a neighbor-joining tree was reconstructed using Ape, and the resulted tree was rooted as shown in Supplemental Figure 4. This rooted tree was used as the guide tree for HFST method with arbitrarily chosen F ST threshold of 0.90 and 0.95, denoted as HFST-0.90 and HFST-0.95 respectively. We subjected Dataset 3 to both HFST-0.90 and HFST-0.95 for 500 repeats each using custom Python script which utilized DecisionTree as implemented in sklearn 0.24.2, to obtain 500 SNP sets for each threshold. To avoid overfitting, a maximum tree depth of 2 was set for each of the necessary DT step. The 500 SNP sets of each threshold were ranked based on the average of minimum and median MCC scores. The top-25 SNP sets of each threshold were then subjected to stratified 10-fold cross-validation steps to select the best SNP set by the average of minimum and median MCC scores, resulting in GEO33 (33 SNPs) and GEO50 (50 SNPs) for HFST-0.90 and HFST-0.95, respectively.

Evaluation and Assessment
An overview of the steps involved in the comparative evaluation of the SNP panels is outlined in section c. Assessment and Validation of the flowchart presented in Figure 1. Briefly, Dataset 4 (which contained the same samples as Dataset 3 but only comprised SNPs from BR38 and selected SNP sets) was used for comparative assessment using stratified k-fold cross-validation across all SNP sets and their combination with BR38 (i.e., BR38, GEO33, GEO50, GEO55, GEO33+BR38, GEO50+BR38 and GEO55+BR38). The data missingness assessment was also performed using Dataset 4. Independent validation was undertaken using the Independent Validation Dataset. All metrics were based on MCC scores.

Assessment of the durabililty of candidate SNP panels
To assess the durability of prediction performance of the candidate SNP panels with different levels of missing data (analogous to genotyping failures), repetitive predictions were run after removing genotype data randomly. The BALK classifier was trained with all samples from Dataset 3 against the candidate SNP panels. To prepare the testing sample set, 25 samples were sampled randomly with replacement for each country from Dataset 3 and genotype calls were randomly removed from the sampled data in 10%, 20% and 30% proportions. The random samples with random missing data were then predicted by the trained classifier. This process was run in 250 repeats and the MCC score of the prediction for each country was reported in Figure 3 and Supplementary Table 6.
The following pseudocode illustrates the detailed steps of the above process: Prepare dataset X FOR each snp set SET random_generator_1 with seed Create dataset X' which only contain snps from current snp set Train BALK classifier with dataset X' against current snp set REPEAT FOR each population in dataset X' Use random_generator_1 to random sampling with replacement Collect the 25 samples to dataset Z FOR proportion in 0.1, 0.2, 0. 3 Copy dataset Z to Z' For each sample in dataset Z' Remove random allel with proportion Predict each sample in dataset Z' using trained BALK Calculate MCC scores for each population Collect all MCC scores, proportion and snp set UNTIL 250 repeats Plot MCC scores per population with snp set and missingness proportion

Validation using independent dataset
To evaluate the performance of the candidate SNP panels with new sample sets (as opposed to using the resampling technique of the cross-validation strategy), the BALK classifier was trained with all samples from Dataset 3 against all candidate SNP panels, including their combination with BR38 (i.e., GEO33, GEO50, GEO55, GEO33+BR38, GEO50+BR38 and GEO55+BR38). Each sample from the Independent Validation Dataset was then predicted by the trained BALK classifiers for each panel set. MCC scores for each country were reported as Table 2 and prediction results were plotted as heatmaps in Figure 4.

Web-based Data Analysis Platform for End-Users
To establish accessible informatics tools for end users, an online platform was created incorporating data classification tools for determining the most likely country of origin of a sample using genetic data derived from different barcodes. Existing source code, developed for a microsatellite-based P. vivax data sharing and analysis platform 10 , was modified to create a new web-based platform (vivaxGEN-geo), to analyze SNP data generated at the geographic barcode. This approach was chosen owing to the ability to i) incorporate manual SNP sets allowing incremental improvements of the barcode in future, ii) evaluate barcodes with incomplete data owing to genotyping failures, and iii) evaluate heterozygous genotype calls, which reflect polyclonal infections. For optimal accuracy, the BALK classifier provided on the online platform has been trained with 941 samples, comprising Dataset 2 (N=799) plus the Independent Validation Dataset (N=142). The classifier tool reports the three highest likelihoods for country of origin and their associated probabilities. To provide better estimates of probability from Naïve Bayes classifier, the probabilities were computed using isotonic method as implemented in CalibratedClassfierCV of sklearn library, with stratified 4-fold cross-validation for the calibration dataset 11 . The web platform is able to receive the input data as string-based barcode representation, column-based tab-delimited text files, and VCF files.

Implementation and Web Service Availability
All custom scripts were implemented in Python, except when noted otherwise, and were run under Python 3.9. All custom scripts to process VCF files and other data are available at https://github.com/trmznt/pys. The Bernoulli Naive Bayes (NaNBernoulliNB) and Biallele Likelihood (BialleleLK) classifiers, both with missing data capability, is implemented in Python and available at https://github.com/trmznt/lkc-bisnp. Both classifiers were written in sklearn-type style and could be used directly in a similar way as the rest of classifiers implemented in sklearn. All analysis was performed with sklearn version 0.24.2 under Python 3.9. The version on website had been updated using Python 3.10 and sklearn version 1.1.1. Chromosome  Position  Reference allele  Alternative allele Gene  Amino acid  change  PvP01_03_v1 220999    . This data quality iterative filtering process determined the number of SNPs which had genotype calls in all samples within a sample set and for which the respective SNPs were informative, designated as SNPs with minor allele count (MAC) >= 2 by removing individual sample from highest to lowest levels of missing genotype calls iteratively from the sample set while collecting the number of remaining samples, the number of SNPs without any missing genotype calls, and the number of informative SNPs without any missing genotype calls. Supplementary Figure 3. SNP overlap between the SNP panels. Venn diagram illustrating the number (and percentage) of SNPs overlapping between the GEO33, GEO50 and GEO55 panels. The baseline for the percentages is the total number of SNPs across the three panels. Details of the overlapping SNPs are presented in Supplementary Table 3. For visual simplicity, the BR38 Broad barcode is not presented in the venn diagram as there was no overlap between any of the SNPs in this panel with any of the three new panels.

Supplementary Table 2. Details of the GEO33 barcode markers
Supplementary Figure 5. Rooted neighbor-joining population tree used in HFST. The neighbor-joining tree was constructed using Ape based on a population distance matrix calculated based on Nei's net average population distance from 799 biologically independent samples. iTOL was then used to root the tree at midpoint and to plot the tree.